Model Selection

Zero-shot Image Classification

# Zero-shot Image Classification

FG-CLIP is a fine-grained visual and text alignment model that achieves global and region-level image-text alignment through two-stage training.

Transformers English

Openvision Vit Base Patch16 224

OpenVision is a fully open, cost-effective family of advanced visual encoders focused on multimodal learning.

Multimodal Fusion

Openvision Vit Large Patch14 224

OpenVision is a fully open, cost-effective family of advanced vision encoders focused on multimodal learning.

Multimodal Fusion

Vit Gopt 16 SigLIP2 256

SigLIP 2 vision-language model trained on WebLI dataset, suitable for zero-shot image classification tasks.

Vit SO400M 14 SigLIP2

A SigLIP 2 vision-language model trained on the WebLI dataset, suitable for zero-shot image classification tasks.

Vit L 16 SigLIP2 384

A SigLIP 2 vision-language model trained on the WebLI dataset, suitable for zero-shot image classification tasks.

Vit B 16 SigLIP2

A SigLIP 2 vision-language model trained on the WebLI dataset, suitable for zero-shot image classification tasks.

Siglip2 So400m Patch16 Naflex

SigLIP 2 is an improved model based on the SigLIP pre-training objective, integrating multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.

Siglip2 Base Patch16 Naflex

SigLIP 2 is a multilingual vision-language encoder that integrates SigLIP's pretraining objectives and introduces new training schemes, enhancing semantic understanding, localization, and dense feature extraction capabilities.

Siglip2 So400m Patch16 512

SigLIP 2 is a vision-language model based on SigLIP, enhanced with improved semantic understanding, localization, and dense feature extraction capabilities.

Siglip2 So400m Patch16 384

SigLIP 2 is an improved model based on the SigLIP pre-training objective, integrating multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.

Siglip2 So400m Patch16 256

SigLIP 2 is an improved model based on SigLIP, integrating multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.

Siglip2 Giant Opt Patch16 384

SigLIP 2 is an improved model based on the SigLIP pretraining objective, integrating multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.

Siglip2 Giant Opt Patch16 256

SigLIP 2 is an advanced vision-language model that integrates multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.

Siglip2 Large Patch16 384

SigLIP 2 is an improved multilingual vision-language encoder based on SigLIP, enhancing semantic understanding, localization, and dense feature extraction capabilities.

Siglip2 Large Patch16 256

SigLIP 2 is an improved vision-language model based on SigLIP, integrating multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.

Siglip2 Base Patch16 512

SigLIP 2 is a vision-language model that integrates multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.

Siglip2 Base Patch16 384

SigLIP 2 is a vision-language model based on SigLIP, enhancing semantic understanding, localization, and dense feature extraction through a unified training approach.

Siglip2 Base Patch16 256

SigLIP 2 is a multilingual vision-language encoder with improved semantic understanding, localization, and dense feature extraction capabilities.

Siglip2 Base Patch16 224

SigLIP 2 is an improved multilingual vision-language encoder based on SigLIP, enhancing semantic understanding, localization, and dense feature extraction capabilities.

Siglip2 Base Patch32 256

SigLIP 2 is an improved version of SigLIP, integrating multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.

Mme5 Mllama 11b Instruct

mmE5 is a multimodal multilingual embedding model trained on Llama-3.2-11B-Vision, improving embedding performance through high-quality synthetic data and achieving state-of-the-art results on the MMEB benchmark.

Multimodal Fusion

Transformers Supports Multiple Languages

Genmedclip B 16 PMB

A zero-shot image classification model based on the open_clip library, specializing in medical field image analysis

Image Classification

GenMedClip is a zero-shot image classification model based on the open_clip library, specializing in medical image analysis.

Image Classification

CLIP ViT L 14 Spectrum Icons 20k

A vision-language model fine-tuned based on CLIP ViT-L/14, optimized for abstract image-text retrieval tasks

TensorBoard English

Eva02 Large Patch14 Clip 336.merged2b

EVA02 CLIP is a large-scale vision-language model based on the CLIP architecture, supporting tasks such as zero-shot image classification.

Vit So400m Patch16 Siglip 256.webli I18n

A vision Transformer model based on SigLIP, focusing on image feature extraction with original attention pooling mechanism.

Image Classification

Vit So400m Patch14 Siglip Gap 384.webli

Vision Transformer model based on SigLIP, utilizing global average pooling for image features

Image Classification

Vit So400m Patch14 Siglip 378.webli

A vision Transformer model based on SigLIP, containing only an image encoder, utilizing the original attention pooling mechanism.

Image Classification

Vit Base Patch16 Siglip 512.webli

Vision Transformer model based on SigLIP architecture, containing only the image encoder part, using original attention pooling mechanism

Image Classification

Vit So400m Patch14 Siglip 224.webli

Vision Transformer model based on SigLIP, containing only the image encoder part, utilizing original attention pooling mechanism

Image Classification

Vit Large Patch14 Clip 224.datacompxl

A vision Transformer model based on the CLIP architecture, specifically designed for image feature extraction, released by the LAION organization.

Image Classification

Vit Giant Patch14 Clip 224.laion2b

Vision Transformer model based on CLIP architecture, designed for image feature extraction, trained on the laion2B dataset

Image Classification

Convnext Base.clip Laion2b Augreg

ConvNeXt Base image encoder based on the CLIP framework, trained on the LAION-2B dataset, supports image feature extraction

Image Classification

Convnext Base.clip Laion2b

CLIP image encoder based on ConvNeXt architecture, trained by LAION, suitable for multimodal vision-language tasks

Image Classification

CLIP SAE ViT L 14

A CLIP model fine-tuned with sparse autoencoder (SAE), excelling in zero-shot image classification tasks, particularly adept at recognizing adversarial typographic attacks

Vit Large Patch14 Clip 224.laion400m E31

A large Vision Transformer model trained on the LAION-400M dataset, supporting zero-shot image classification tasks

Image Classification

Vit Base Patch16 Plus Clip 240.laion400m E31

A vision-language dual-purpose model trained on the LAION-400M dataset, supporting zero-shot image classification tasks

Image Classification

Vit Base Patch16 Clip 224.metaclip 400m

A dual-framework compatible vision model trained on the MetaCLIP-400M dataset, supporting both OpenCLIP and timm frameworks

Image Classification

Vit Base Patch16 Clip 224.laion400m E32

Vision Transformer model trained on the LAION-400M dataset, compatible with both open_clip and timm frameworks

Image Classification

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase